In [1]:
import pandas as pd
import numpy as np
import folium

from gnact import utils, clust, plotting, network

import warnings
warnings.filterwarnings('ignore')

Demonstration of Plotting and Density Based Clustering Methods Using GNACT

image-2.png

Reducing dense, geospatial datasets to starts and stops of indvidual selectors challenges conventional GIS techniques. Geospatial Netowrk Analysis with Clustering and Time (GNACT) is a Python pacakge that helps parse and analyze geospatial trajectoreis.

In the following examples, we will use ship position data from AIS to track the MSC ARUSHI (MMSI:636016432) during its voyages in 2017. We'll compare several different density based clustering methods, static trip segmentation based on known sites, and an implementation of dynamic trip segmentation from Scikit-Mobility. Using GNACT, we will plot the trajectories, calculate the precision, recall, and F1 measure against a "gold-standard" ground truth, and plot clusters in different ways to help an analyst understand how a particular clustering methodology or set of hyperparameters is working.

Plot of Raw Data

In [35]:
# load the position data
df_posits = pd.read_csv('df_posits_636016432.csv', parse_dates=['time'])
df_posits.head()
Out[35]:
id lat lon time
0 15867231 42.28535 -69.12919 2017-01-06 07:04:31
1 15867232 42.28576 -69.14447 2017-01-06 07:06:31
2 15867550 42.28601 -69.15519 2017-01-06 07:07:55
3 15867549 42.28627 -69.16453 2017-01-06 07:09:08
4 15867237 42.28669 -69.17284 2017-01-06 07:10:13
In [36]:
# plot with Folium
m = folium.Map(location=[df_posits.lat.median(), df_posits.lon.median()],
               zoom_start=4, tiles='OpenStreetMap')
points = list(zip(df_posits.lat, df_posits.lon))
folium.PolyLine(points).add_to(m)
m
Out[36]:

Use the World Port Index as Reference Sites

This list is not exhaustive, but its a good example of a real-world reference dataset where most of the major sites are known but many smaller sites are not.

In [4]:
df_sites = pd.read_csv('wpi_clean.csv')
df_sites.head()
Out[4]:
site_id site_name lat lon region
0 61090 SHAKOTAN 43.866667 146.833333 NaN
1 61110 MOMBETSU KO 44.350000 143.350000 NaN
2 5750 CHARLOTTETOWN 46.233333 -63.133333 NaN
3 61120 ABASHIRI KO 44.016667 144.283333 NaN
4 61130 NEMURO KO 43.333333 145.583333 NaN

Finding Stops/Clusters at Known Ports Using Static Trip Segmentation

We will generate clusters for all positions that spend a minimum amount of time within a certain distance of any known site. This is known as static trip segmentation in the literature, and will have the lowest false positive rate of any method because all positions clustered must be within a certain distance of a known port.

I previously developed a static trip segmentation methodology during a Directed Stuides on Network Analysis. By applying this methodology against a geospatial dataset with a known set of ports, I generated a network map with each stop repersenting a node and travel between ports as edges.

Calculate Nearest Site for Each Position and Create a List of Stops

The first step in static trip segmentation is to calculate the nearest known site for each position. Here we will apply the approach against a DataFrame using a custom function.

We will use a distance threshold of 5km and loiter time of 6 hours (360 minutes). This means a cargo ship must spend at least 6 hours within 5km of a known port to be counted as making a stop.

In [5]:
dist_threshold_km = 5
loiter_time_mins = 360

df_nn = clust.calc_nn(df_posits, df_sites)
df_nn.head()
Out[5]:
id nearest_site_id dist_km
0 15867231 7290 120.10
1 15867232 7290 118.45
2 15867550 7290 117.30
3 15867549 7290 116.29
4 15867237 7290 115.40

Building "Ground Truth" From Static Trip Segmentation Results

Now we can apply static trip segmentation against this data and use it as our "ground truth".

In [7]:
# determine the "ground truth" for this sample
df_stops = network.calc_static_seg(df_posits, df_nn, df_sites, 
                                   dist_threshold_km, loiter_time_mins)
# plot results 
plotting.plot_stops(df_stops, df_posits)
Plotted 29 total sites.
Out[7]:
In [8]:
df_stops.head()
Out[8]:
node destination arrival_time depart_time time_diff position_count site_id site_name lat lon region
0 7250 0.0 2017-01-06 12:11:38 2017-01-07 07:25:59 0 days 19:14:21 396.0 7250 BOSTON 42.350000 -71.050000 NaN
1 7810 7850.0 2017-01-08 11:23:07 2017-01-09 07:05:57 0 days 19:42:50 405.0 7810 NEWARK 40.700000 -74.150000 NaN
2 8120 0.0 2017-01-10 01:31:26 2017-01-10 15:20:21 0 days 13:48:55 371.0 8120 GLOUCESTER 39.900000 -75.133333 NaN
3 9985 0.0 2017-01-13 10:06:34 2017-01-14 21:51:17 1 days 11:44:43 3.0 9985 FREEPORT 26.516667 -78.783333 NaN
4 7250 0.0 2017-02-21 10:57:41 2017-02-21 21:28:35 0 days 10:30:54 228.0 7250 BOSTON 42.350000 -71.050000 NaN

After plotting the ports visited, its clear that there was activity near Savannah that was not recorded. Turns out the port near Savannah where the MSC ARUSHI stopped was not in the database. We can add the port manually, re-generate our nearest_neighbor df, and recompute our statistics.

In [9]:
# manually create the site
savannah_site = {'site_id':3, 'site_name': 'SAVANNAH_MANUAL_1', 'lat': 32.121167, 'lon':-81.130085, 
               'region':'East_Coast'}
# add the site to the df_sites
df_sites = df_sites.append(savannah_site, ignore_index=True) # add savannah
# recompute the nearest neighbors
df_nn = clust.calc_nn(df_posits, df_sites)
In [10]:
# determine the "ground truth" for this sample
df_stops = network.calc_static_seg(df_posits, df_nn, df_sites, 
                                   dist_threshold_km, loiter_time_mins)
# plot results 
plotting.plot_stops(df_stops, df_posits)
Plotted 31 total sites.
Out[10]:

Review "Ground Truth"

Now our new Savannah Port is correctly identifed as a ground truth cluster. We can next use our code to generate clusters and compare them to the ground truth.

In [11]:
df_stops.groupby('site_name').agg('count').iloc[:,0]
Out[11]:
site_name
BOSTON                8
FREEPORT              2
GLOUCESTER           11
NEWARK                8
SAVANNAH_MANUAL_1     2
Name: node, dtype: int64
In [12]:
df_stops.drop(['lat','lon','region','destination', 'position_count', 'node'], axis=1).head(5)
Out[12]:
arrival_time depart_time time_diff site_id site_name
0 2017-01-06 12:11:38 2017-01-07 07:25:59 0 days 19:14:21 7250 BOSTON
1 2017-01-08 11:23:07 2017-01-09 07:05:57 0 days 19:42:50 7810 NEWARK
2 2017-01-10 01:31:26 2017-01-10 15:20:21 0 days 13:48:55 8120 GLOUCESTER
3 2017-01-13 10:06:34 2017-01-14 21:51:17 1 days 11:44:43 9985 FREEPORT
4 2017-02-21 10:57:41 2017-02-21 21:28:35 0 days 10:30:54 7250 BOSTON

Apply Algoithms, Get Clusters, Compare to Ground Truth, and Plot Results

Plot of DBSCAN with Low Parameters

In [14]:
# execute clustering algo with hyperparameters
df_clusts = clust.calc_clusts(df_posits, eps_km=1, min_samp=50, method='dbscan')
plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
{'precision': 0.2143, 'recall': 0.0968, 'f1': 0.1333}
Plotted 33 predicted clusters and 31 ground truth clusters.
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
Out[14]:

Plot of DBSCAN with High Parameters

In [16]:
# execute clustering algo with hyperparameters
df_clusts = clust.calc_clusts(df_posits, eps_km=3, min_samp=2000, method='dbscan')
plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
{'precision': 1.0, 'recall': 0.0968, 'f1': 0.1765}
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
Plotted 7 predicted clusters and 31 ground truth clusters.
Out[16]:

DBSCAN with Tuned Parameters

In [17]:
df_clusts = clust.calc_clusts(df_posits, eps_km=1, min_samp=250, method='dbscan')
plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
{'precision': 0.5714, 'recall': 0.129, 'f1': 0.2105}
Plotted 9 predicted clusters and 31 ground truth clusters.
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
Out[17]:

DBSCAN With Speed Filter

In [19]:
# enhance the df with speed, course and additional trajecotry info
df_traj_enhanced = utils.traj_enhance_df(df_posits)
# filter down to points below certain speed
df_slow_posits = df_traj_enhanced[df_traj_enhanced['speed_kts'] < 1]
# cluster only the slow points
df_clusts = clust.calc_clusts(df_slow_posits, eps_km=2, min_samp=50, method='dbscan')

plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
/Users/patrickmaus/PycharmProjects/gnact/clust.py:62: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['time'] = df['time'].dt.floor('min')
{'precision': 0.625, 'recall': 0.1613, 'f1': 0.2564}
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
Plotted 8 predicted clusters and 31 ground truth clusters.
Out[19]:

OPTICS with Tuned Parameters

In [20]:
df_clusts = clust.calc_clusts(df_posits, eps_km=5, min_samp=200, method='optics')
plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/sklearn/cluster/optics_.py:787: RuntimeWarning: divide by zero encountered in true_divide
  ratio = reachability_plot[:-1] / reachability_plot[1:]
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
{'precision': 0.88, 'recall': 0.7097, 'f1': 0.7857}
Plotted 27 predicted clusters and 31 ground truth clusters.
Out[20]:
In [21]:
df_centers = clust.calc_centers(df_clusts)
df_centers.head()
Out[21]:
clust_id time_min time_max total_clust_count average_lat average_lon average_dist_from_center time_diff
0 0 2017-01-06 13:30:00 2017-12-27 12:00:00 364 42.342169 -71.019036 0.000683 354 days 22:30:00
1 1 2017-02-21 21:08:00 2017-12-27 22:45:00 347 42.342170 -71.019128 0.000697 309 days 01:37:00
2 2 2017-01-06 12:30:00 2017-06-01 22:52:00 497 42.342071 -71.019059 0.001078 146 days 10:22:00
3 3 2017-04-12 08:41:00 2017-08-08 23:33:00 261 42.342166 -71.019269 0.000770 118 days 14:52:00
4 4 2017-01-08 07:25:00 2017-09-15 03:37:00 450 40.471352 -73.624453 3.239371 249 days 20:12:00

ST_DBSCAN with Tuned Parameters

In [23]:
# Processing is upwards of an hour...
#df_clusts = stdbscan.ST_DBSCAN(df_posits, spatial_threshold=3, temporal_threshold=600, min_neighbors=100)
#df_clusts.to_csv('../st_dbscan_results.csv', index=False)
In [24]:
df_clusts = pd.read_csv('st_dbscan_results.csv', parse_dates=['time'])
df_clusts = df_clusts[df_clusts['clust_id'] != -1]

plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
{'precision': 0.8529, 'recall': 0.9355, 'f1': 0.8923}
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
Plotted 36 predicted clusters and 31 ground truth clusters.
Out[24]:

Experimentation with Double Clustering Approaches

By clustering raw positions to centerpoints, and then clustering those centerpoints, we can use lower thresholds to cluster positions by ships. Then we can cluster the resulting centerpoints with a minimum sample size above a certain noise threshold.

In [26]:
from sklearn.cluster import DBSCAN

df_clusts = clust.calc_clusts(df_posits, eps_km=3, min_samp=2000, method='dbscan')
#df_clusts = clust.calc_clusts(df_posits, eps_km=5, min_samp=200, method='optics')

# need new unique cluster ids across each uid.
clust_count = 0
# will hold results of second round temporal clustering
df_second_round = pd.DataFrame()

# begin iteration.  Look at each cluster in turn from first round results and cluster across time
clusters = df_clusts['clust_id'].unique()
for c in clusters:
    df_c = df_clusts[df_clusts['clust_id'] == c]
    X = ((df_c['time'].astype('int').values) / ((10**9)*60)).reshape(-1,1) #converts time to mins
    x_id = df_c.loc[:, 'id'].astype('int').values
    # cluster again using DBSCAN with a temportal epsilon (minutes) in one dimension
    dbscan = DBSCAN(eps=600, min_samples=2, algorithm='kd_tree',
                    metric='euclidean', n_jobs=1)
    dbscan.fit(X)
    results2_dict = {'id': x_id, 'clust_id': dbscan.labels_}
    # gather the output as a dataframe
    df_clusts2 = pd.DataFrame(results2_dict)
    df_clusts2 = df_clusts2[df_clusts2['clust_id'] != -1]
    clusters2 = df_clusts2['clust_id'].unique()
    for c2 in clusters2:
        df_c2 = df_clusts2[df_clusts2['clust_id'] == int(c2)] # need int rather than numpy.int64
        # need to assign a new cluster id
        df_c2['clust_id'] = clust_count
        df_second_round = df_second_round.append(df_c2)
        clust_count +=1

df_second_results = pd.merge(df_second_round, df_clusts.drop('clust_id', axis=1), how='left', left_on='id', right_on='id')


plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/ipykernel_launcher.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/ipykernel_launcher.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/ipykernel_launcher.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/Users/patrickmaus/PycharmProjects/gnact/clust.py:262: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fn['results'] = 'False Negative'
{'precision': 1.0, 'recall': 0.0968, 'f1': 0.1765}
Plotted 7 predicted clusters and 31 ground truth clusters.
Out[26]:

Integration of Scikit-Mobility and Dynamic Segmentation of Trips

Scikit-Mobility provides additional plotting and packaging tools to parse a geospatial dataset into "trips" based on "stops" along each UID's path.

In [28]:
import skmob
In [29]:
#df_posits = clust.get_uid_posits(('636016432',), engine, end_time='2018-01-01')
#df_posits['uid'] = '636016432'
tdf = skmob.TrajDataFrame(df_posits, latitude='lat', longitude='lon', datetime='time')
tdf.plot_trajectory(tiles='OpenStreetMap', zoom=4)
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:164: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:173: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Out[29]:

SKMOB has a "stop detection" algorithm to identify stops based on distance, duration, max speed, and includes an escape clause when there is a gap in data.

In [30]:
from skmob.preprocessing import detection
stdf = detection.stops(tdf, minutes_for_a_stop=360, spatial_radius_km=2, leaving_time=True, 
                       no_data_for_minutes=360, min_speed_kmh=70)

print('Points of the original trajectory:\t%s'%len(tdf))
print('Points of stops:\t\t\t%s'%len(stdf))

m = stdf.plot_trajectory(max_users=1, start_end_markers=False, tiles='OpenStreetMap', zoom=4)
stdf.plot_stops(max_users=1, map_f=m)
Points of the original trajectory:	55923
Points of stops:			34
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:266: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:267: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Out[30]:
In [31]:
stdf.head()
Out[31]:
id lat lng datetime time_rounded time_diff time_diff_hours cog dist_nm speed_kts leaving_datetime
0 15893550 42.34207 -71.019040 2017-01-06 12:12:00 2017-01-07 07:34:00 00:01:00 0.016667 95.02 0.137256 8.235332 2017-01-07 07:35:00
1 13723030 40.68117 -74.146530 2017-01-08 11:29:00 2017-01-09 07:21:00 00:01:00 0.016667 201.80 0.130001 7.800066 2017-01-09 07:22:00
2 13843630 39.89867 -75.135875 2017-01-10 01:42:00 2017-01-10 14:50:00 00:01:00 0.016667 50.46 0.035844 2.150660 2017-01-10 14:51:00
3 31490700 42.34207 -71.019060 2017-02-21 10:55:00 2017-02-21 21:37:00 00:01:00 0.016667 91.94 0.177228 10.633680 2017-02-21 21:38:00
4 30552227 40.68529 -74.153170 2017-02-23 04:51:00 2017-02-24 00:15:00 00:01:00 0.016667 203.20 0.160722 9.643336 2017-02-24 00:16:00

We can then use DBSCAN to cluster the stops to find "destinations" frequently visited.

In [32]:
from skmob.preprocessing import detection, clustering
cstdf = clustering.cluster(stdf, cluster_radius_km=1, min_samples=1)
cstdf.head()
Out[32]:
id lat lng datetime time_rounded time_diff time_diff_hours cog dist_nm speed_kts leaving_datetime cluster
0 15893550 42.34207 -71.019040 2017-01-06 12:12:00 2017-01-07 07:34:00 00:01:00 0.016667 95.02 0.137256 8.235332 2017-01-07 07:35:00 2
1 13723030 40.68117 -74.146530 2017-01-08 11:29:00 2017-01-09 07:21:00 00:01:00 0.016667 201.80 0.130001 7.800066 2017-01-09 07:22:00 1
2 13843630 39.89867 -75.135875 2017-01-10 01:42:00 2017-01-10 14:50:00 00:01:00 0.016667 50.46 0.035844 2.150660 2017-01-10 14:51:00 0
3 31490700 42.34207 -71.019060 2017-02-21 10:55:00 2017-02-21 21:37:00 00:01:00 0.016667 91.94 0.177228 10.633680 2017-02-21 21:38:00 2
4 30552227 40.68529 -74.153170 2017-02-23 04:51:00 2017-02-24 00:15:00 00:01:00 0.016667 203.20 0.160722 9.643336 2017-02-24 00:16:00 1
In [33]:
m = cstdf.plot_trajectory(max_users=1, start_end_markers=False, tiles='OpenStreetMap', zoom=4)
cstdf.plot_stops(max_users=1, map_f=m)
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:266: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/scikit_mobility-1.0-py3.7.egg/skmob/utils/plot.py:267: FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Out[33]:

Now we can add it to our existing function wrappers and determine the statistics

In [34]:
df_clusts = clust.calc_clusts(df_posits, eps_km=3, eps_time=360, method='dynamic')

plotting.analyze_clusters(df_posits, df_clusts, df_stops, dist_threshold_km)
/Users/patrickmaus/PycharmProjects/gnact/clust.py:123: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster['clust_id'] = cluster_id
/Users/patrickmaus/PycharmProjects/gnact/clust.py:124: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster['time'] = cluster['datetime']
/Users/patrickmaus/PycharmProjects/gnact/clust.py:125: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cluster['lon'] = cluster['lng']
/Users/patrickmaus/.conda/envs/AIS_project/lib/python3.7/site-packages/pandas/core/frame.py:3997: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
{'precision': 0.8824, 'recall': 0.9677, 'f1': 0.9231}
/Users/patrickmaus/PycharmProjects/gnact/clust.py:255: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_fp['results'] = 'False Positive'
/Users/patrickmaus/PycharmProjects/gnact/clust.py:258: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tp['results'] = 'True Positive'
Plotted 34 predicted clusters and 31 ground truth clusters.
Out[34]: